Residual Tokens Enhance Masked Autoencoders For Speech Modeling
Abstract
Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness
Summary
- Explicit / Residual separation
Speech generation is decomposed into:- explicit attributes (e.g., pitch, linguistic content, speaker identity) for interpretable control,
- continuous residual tokens capturing phenomena not modeled by attributes.
- explicit attributes (e.g., pitch, linguistic content, speaker identity) for interpretable control,
- Residual tokens via cross-attention
- Inspired by the Perceiver, a fixed set of learnable queries extracts residual information from the spectrogram.
- Provides a compact and controlled representation independent of sequence length.
- Inspired by the Perceiver, a fixed set of learnable queries extracts residual information from the spectrogram.
- Complementarity and flexibility
- Residual tokens enrich explicit attributes, enabling speech that is both more natural and controllable.
- Dropout-based regularization
- Targeted dropout on residual tokens prevents over-reliance.
- Enforces effective use of explicit attributes, ensuring interpretability and controllability.
- Targeted dropout on residual tokens prevents over-reliance.
- Unified MAE-based architecture
- Built upon the MAE paradigm: discrete tokens for spectrograms and attributes, continuous tokens for residuals, with Transformer encoding and HiFi-GAN decoding.
- Partial masking during training encourages intra- and inter-modal dependency learning.
- Built upon the MAE paradigm: discrete tokens for spectrograms and attributes, continuous tokens for residuals, with Transformer encoding and HiFi-GAN decoding.
👉 RT-MAE introduces continuous residual tokens extracted via cross-attention and regularized by dropout, combining controllability from explicit attributes with flexibility from residuals within an MAE framework for speech generation.